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Abstract. Developmental transcriptional networks in plants and an- 
imals operate in both space and time. To understand these transcrip- 
tional networks it is essential to obtain whole-genome expression data at 
high spatiotemporal resolution. Substantial amounts of spatial and tem- 
poral microarray expression data previously have been obtained for the 
Arabidopsis root; however, these two dimensions of data have not been 
integrated thoroughly. Complicating this integration is the fact that 
these data are heterogeneous and incomplete, with observed expression 
levels representing complex spatial or temporal mixtures. Given these 
partial observations, we present a novel method for reconstructing in- 
tegrated high resolution spatiotemporal data. Our method is based on 
a new iterative algorithm for finding approximate roots to systems of 
bilinear equations. 



1. Introduction 

Transcriptional regulation plays an important role in orchestrating a host 
of biological processes, particularly during development (reviewed in [9lll3j). 
Advances in microarray and sequencing technologies have allowed biologists 
to capture genome-wide gene expression data; the output of this transcrip- 
tional regulation. This expression data can then be used to identify genes 
whose expression is correlated with a particular biological process, and to 
identify transcriptional regulators that coordinate the expression of groups 
of genes that are important for the same biological process. 

The identification of such genes and transcriptional regulators is compli- 
cated by the complex heterogeneous mixture of cell types and developmental 
stages that comprise each organ of an organism. Expression patterns that 
are found only in a subset of cell types within an organ will be diluted and 
may not be detectable in the collection of expression patterns obtained from 
RNA isolated from samples of an entire organ. Therefore techniques have 
been developed to enrich samples for specific cell types or developmental 
stages, especially for studies in plants [5]. In the model plant, Arabidopsis 
thaliana, several features of the root organ reduce its developmental com- 
plexity and facilitate analysis. Specifically, most root cell types are found 
within concentric cylinders moving from the outside of the root to the inside 
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of the root (Figure [T]). These cell type layers display rotational symmetry 
thus simplifying the spatial features of development. This feature has been 
exploited in the development of a cell type enrichment method. This enrich- 
ment method uses green fluorescent protein (GFP)-marked transgenic lines 
and fluorescently-activated cell sorting (FACS) to collect cell type enriched 
samples and has allowed for the identification of cell type-specific expression 
patterns [TJ [2] . Using this technique, high resolution expression data have 
been obtained for nearly all cell types in the Arabidopsis root (herein called 
the marker-line dataset) [H ITU]. 

Another feature that makes the Arabidopsis root a tractable developmen- 
tal model is that cell types are constrained in files along the root's longitu- 
dinal axis and most of these cells are produced from a stem cell population 
found at the apex of the root. This feature allows a cell's developmental 
timeline to be represented by its position along the length of the root. To 
obtain a developmental time-series expression dataset individual Arabidopsis 
roots were sectioned into thirteen pieces, each piece representing a develop- 
mental time point (herein called the longitudinal dataset) [3]. Each of these 
sections, however, contains a mixture of cell types, and the microarray ex- 
pression values obtained are therefore the average of the expression levels 
over multiple cell types present at these specific developmental time points. 

While the 19 fluorescently marked lines in Brady et al. [4] cover expression 
in nearly all cell types, they do not comprehensively mark all developmental 
stages of these cell types. Also, the procambium cell type was not measured, 
as a fluorescent marker-line that marks that cell type did not exist at the 
time. However, expression from the longitudinal dataset, does contain av- 
eraged expression of all cell types, and may be used to infer the missing cell 
type data. 

Previous studies have looked at separating expression data from the het- 
erogeneous cell populations that make up tumors into the contributions of 
their constituent cell types (HIES]- However, in that context, the difficulty 
comes from the fact that the mixture of cell types in each sample is unknown, 
whereas within our experimental context, the cell type mixture of each sam- 
ple is known. Two computational methods have been developed to combine 
the Arabidopsis longitudinal and marker-line datasets as experimentally re- 
solving this expression with marker lines is nearly impossible jH |6] . However, 
neither method takes all data into account when reconstructing expression. 
In jl], only high relative gene expression is considered, and in [6], no attempt 
is made to infer expression for cells not covered by any marker-line. 

In this work we formulate a model for expression levels in Arabidopsis 
roots in which cell type and developmental stage are independent sources of 
variation. The microarray data specifying overall expression levels for cer- 
tain mixtures of cells lead to an overconstrained system of bilinear equations. 
Moreover, due to the nature of the problem, we are exclusively interested in 
positive real solutions. We present a new method for finding non-negative 
real approximate solutions to bilinear equations, based on the techniques 
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of expectation maximization (EM) |15| Sec. 1.3] and iterative proportional 
fitting (IPF) [7] from likelihood maximization in statistics. Earlier work 
has used expectation maximization to find non-negative matrix factoriza- 
tions [TT], and our method is a generalization of that work. 

We applied our method to estimate spatiotemporal subregion expression 
patterns for 20,872 Arabidopsis transcripts. These patterns have identified 
gene expression in cell types and developmental stages which were previously 
unknown. A searchable database of verlays of these patterns on a schematic 
Arabidopsis root is under development and will be made publicly available 



at http : // www . arexdb . org 



2. Methods 

2.1. Expression data. Our method uses the normalized expression data 
collected in Expression levels were measured across 13 longitudinal sec- 
tions in a single root (longitudinal dataset) and across 19 different markers 
(marker-line dataset). For simplicity, the J2501 line was removed from fur- 
ther analysis as it is redundant with the WOODEN-LEG marker-line. The 
APL marker-line was also removed, as it contains domains of expression 
marked by both the S32 and the SUC2 marker-lines and adds no extra in- 
formation. The remaining 17 markers covering 14 cell types are listed in the 
second column of Table [TJ 

Due to computational constraints, the original normalization of this data 
was performed for the longitudinal and the marker-line datasets indepen- 
dently [4J. In order to account for differences caused by these separate 
normalization procedures, we adjusted the marker-line data by a global fac- 
tor of 0.92. This factor was calculated by comparing the expression values 
of ubiquitous, evenly expressed probe sets between the two datasets. We 
assume that by comparing these probe sets, any true expression differences 
due to cell type and longitudinal section specificity should be minimal and 
thus any differences in expression level is a byproduct of the separate nor- 
malization procedures. A set of 43 probesets were identified which were 
expressed ubiquitously (above a normalized value of 1.0 in all samples) and 
whose expression did not vary significantly among samples within a dataset 
(ratio of min/max expression within a dataset is at most 0.5). The scaling 
factor necessary to make the mean expression within the marker-line dataset 
equal to the mean expression within the longitudinal dataset was calculated 
for each probe set in this set. The median value of these 43 scaling factors 
was 0.92, which was used as the global adjustment factor (Table [5]). 

2.2. Model. To model the transcript expression level of an individual cell 
we assume that the effects of its cell type and its section on its expression 
level are independent of each other. More precisely, we assume that the 
transcript expression level of a cell of type j in section i is equal to the 
product Xi ■ yj, where X{ depends only on the section and yj depends only on 
the cell type. In other words, for each transcript, there is an idealized profile 
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Cell type 


Marker-lines 


Quiescent center 


AGL42, RM1000, SCR5 


Columella 


PET111 


Lateral root cap 


LRC 


Hair cell 


COBL9 (8-13) 


Non-hair cell 


GL2 


Cortex 


J0571, CORTEX (7-13) 


Endo dermis 


J0571, SCR5 


Xylem pole pericycle 


WOL (2-9), J0121 (9-13), J2661 (13) 


Phloem pole pericycle 


WOL (2-9), S17 (8-13), J2661 (13) 


Phloem 


S32, WOL (2-9) 


Phloem companion cells 


SUC2 (10-19), WOL (2-9) 


Xylem 


S4 (2-7), S18 (8-13), WOL (2-9) 


Lateral root primordia 


RM1000 


Procambium 


WOL (2-9) 



Table 1 . The 14 cell types in the Arabidopsis root and the 
17 marker-lines which mark them [3J. For markers that only 
mark the cell type in some of the sections, these sections are 
indicated by the range in parenthesis. 



of expression over different cell types, and an idealized profile of expression 
over different sections. Within a given section, our assumption is that the 
transcript expression level varies proportionally to its cell type profile, and 
within a given cell type, proportionally to its longitudinal profile. 



Each microarray sample in the two datasets (described in Section 2.1), 
is composed of a distinct mixture of cell types and sections. Within each 
sample, the measured transcript expression level is a convex linear com- 
bination of the expression levels of its constituent cells. Under the above 
assumptions, these measurements constitute a system of bilinear equations, 

13 14 

(1) ^^cLijkXiyj = b k for k = 1,..., 30 

i=i j=i 

where X{ and yj are the model parameters for the 13 sections and 14 cell 
types respectively, and b k is the measured expression level as k ranges over 
the 30 measured samples (13 longitudinal sections and 17 markers). 

The coefficients ayfc are obtained by combining the cell type by marker 
line data (Table [T]) with the the cell-count matrix (Table |2j). For k at most 
13, the measurement with index k comes from the kth longitudinal section. 
We will use a„fc to denote the corresponding matrix, with in the ith 
row and jth column. We set this matrix to be zero everywhere except the 
/cth row, where it is proportional to the /cth row of the cell-count matrix, but 
rescaled to sum to 1. For k greater than 13, the measurements come from 
one of the 17 marker-lines. The matrix is likewise zero except for those 
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Table 2 . The cell count matrix gives the number of cells in 
each spatiotemporal subregion. The 13 rows correspond to 
longitudinal sections 1 through 13. From left to right, the 
14 columns correspond to the following spatiotemporal sub- 
regions: quiescent center, columella, lateral root cap, hair 
cell, non-hair cell, cortex, endodermis, xylem pole pericy- 
cle, phloem pole pericycle, phloem, phloem companion cells, 
xylem, lateral root primordia, and procambium. 



spatiotemporal subregions marked by that marker as indicated in Table [T] 
Note that the non-zero entries of a**k may span multiple columns for those 
markers which are listed in multiple rows of Table [T] The non-zero entries 
of a**k are proportional to the corresponding entries of the cell matrix, but 
rescaled to sum to 1. 

2.3. Cell matrix. As described in the previous section, the coefficients a^-fc 
in our model depend on the number of cells in each spatiotemporal subre- 
gion. These cell number estimates were generated by visual inspection of 
successive optical cross-sections of Arabidopsis roots along the longitudi- 
nal axis using confocal laser scanning microscopy. For the xylem, phloem 
and procambium cell types, cell counts were obtained from earlier exper- 
iments [31 114] . What follows is a detailed description of this visual and 
literature analysis. These results are also summarized in Table 2. 

Longitudinal section 1 encompasses two tiers of 12 columella cells, and 
three tiers of lateral root cap cells (15, 18 and 18 moving up from the tip). 

Longitudinal section 2 contains one tier of 12 columella cells and six tiers 
of lateral root cap cells (20, 20, 28, 28, 28 and 28 moving up from the 
tip). For all other cell types in longitudinal section 2, three tiers of cells 
are present. Eight trichoblast (hair cell precursor) cells and 16 atrichoblast 
(non-hair cell precursor) cells are present circumferentially throughout the 
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root, resulting in 24 and 48 cells respectively in the hair cell and non-hair 
cell precursor files in longitudinal section 2. Throughout the root, eight 
cortex and eight endodermis cells are present circumferentially. However in 
longitudinal section 2, the cortex/endo dermis initial is undergoing assymet- 
ric periclinal divisions to produce the cortex and endodermis cell files, so we 
consider there to be approximately 0.5 cells of the cortex and endodermis 
type, resulting in 12 cells of each type in longitudinal section 2. When the 
Arabidopsis root is seven days old, each longitudinal section from 3-13 con- 
tains approximately five cells of each type along the root's longitudinal axis. 

In longitudinal section 2, the tangential and periclinal divisions that give 
rise to phloem cell files do not occur, but do occur in longitudinal sec- 
tion 3 [3] . Three cells are present in the main xylem axis in the first tier of 
cells, four cells in the second tier, and five cells in the third tier p3]. Eight 
procambial cells are present in the first cell tier, 12 procambial cells in the 
second tier, and 18 cells in the third tier resulting in 28 procambial cells in 
longitudinal section 2 [14 . For all sections xylem pole pericycle cells are the 
two cells that flank the xylem axis on either end, and phloem pole pericycle 
cells are considered the intervening cells. Four pericycle cells can be identi- 
fied as flanking xylem cells in all three tiers of cells present in longitudinal 
section 2 [2] . Seven intervening phloem pole pericycle cells can be found in 
tier one, and eight intervening cells can be identified in the third tier |14j . 
resulting in 22 procambial cells in longitudinal section 2. 

In a seven day old root, each of the longitudinal sections 3-13 contains 
approximately five tiers of cells. In longitudinal section 3, columella cells 
can no longer be identified, and 10 tiers of lateral root cap cells exist con- 
taining 28 cells each. In sections 4-6, a lateral root cap cell is twice the 
length and half the width of an epidermal cell. Eighty-four cells were identi- 
fied in each tier, and two and a half tiers of cells exist each for longitudinal 
sections 4-6 resulting in 210 cells for each longitudinal section. All other 
cell types have undergone the appropriate tangential and periclinal divi- 
sions to establish their respective cell files by longitudinal section 3. Two 
protophloem cells, two metaphloem cells and four accompanying compan- 
ion cells are present in the phloem tissue [3J. With the combination of 
protophloem and metaphloem cells, 20 phloem cells and 20 companion cells 
exist in each longitudinal section. Approximately 40 procambial cells exist in 
each longitudinal section. Secondary cell growth does not occur in the devel- 
opmental stages sampled, therefore, this number remains fixed throughout 
all developmental stages. In longitudinal section 12, a non-emerged lateral 
root is hypothesized to be present based on microarray expression data [4|. 
This lateral root is estimated to be approximately 130 cells, or one tier of 
cells in longitudinal section 2. 

In our modelling the distinct vasculature, protophloem and metaphloem 
cell types were treated as a single cell type, as no marker-line was specific 
enough to differentiate clearly between these cell types. Also, the metaxylem 
and protoxylem were considered as a single cell type by the same rationale. 
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2.4. Solving Bilinear Equations. In this section we present our method 
for solving the system of bilinear equations given by ([I]) . More generally, we 
have a system 



(2) f(x, y) -—y^y] aijkXjyj = b k for k = 1, . . . ,£. 

i=l j=i 

In our application, we have n = 13, m = 14, and t = 30. Unlike other nu- 
merical methods for solving systems of polynomial equations, our algorithm 
has the advantage that it finds only non-negative, real solutions. Moreover, 
even in systems where there are no exact solutions, as will generally be the 
case in an overconstrained system of equations, our method will find ap- 
proximate solutions. A more detailed, technical mathematical study of the 
method will be available in a forthcoming paper by the first author. 

Our method is based on the Expectation-Maximization (EM) |15l Sec. 
1.3] and Iterative Proportional Fitting (IPF) [7 a algorithms used for maxi- 
mum likelihood estimation in statistics. These are iterative algorithms which 
reduce the modified Kullback-Leibler divergence at each step: 

(3) D(f(x, y) \\b) = log (jJ^Tyj) ~ b " + V)) • 

The traditional Kullback-Leibler divergence consists only of the first term 
in the summation. The other two terms are a natural generalization, which 
is necessary only when the vectors f(x,y) and b do not sum to one |llj . 

Our algorithm begins with an arbitrarily chosen starting point (x^°\ y^) 
in M>q~"'. In each iteration s, the expectation step computes the quantities: 

0) 0) 

2^i'=i 2_/j'=i a %'j'kX i i Vj/ 

(s) 

for all i, j, and k. This quantity w\- k is an estimate of the contribution 
of the term in the fcth equation in (|2]). The maximization step is an 
analogue of the IPF algorithm, and itself consists of an iteration. We first 
compute the analogues of the sufficient statistics: 



j=l k=l 
n I 
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Then we perform an iteration beginning with xf 1 '^ = xf^ and yf = yf^ 
and the update rules 



>,*+!) ._ Js,t) A i 



x} := x- 



0,*)„>,t) 



Em v^* \ s ' 1 ' 
j=l 2^k=l a ijk x i V 



3 



(s,t+l) ._ (s,t) ' ■ 



2_^j=i Z^k=i a tjkx i Uj 

until the parameters converge. We then re-normalize and use the values 
from the last index t for the next step of the EM algorithm: 

(s,t) 

_(«+!) - = X i 

Z-ji'=l x i' 

m 



s+1 s,t fa. 

y ) y ) x i' 



At each step of each of these algorithms, the Kullback-Leibler divergence 
defined in Q decreases. In the statistics literature, the convergence of the 
EM and IPF algorithms is known under the additional assumptions that 

n m I 

i=l j=l k=l 

However, relaxing these conditions does not change the convergence proof. 

We repeatedly ran our EM algorithm beginning with 20 different ran- 
domly chosen starting points. For each transcript in the data, all 20 runs of 
the algorithm converged to the same solution, strongly suggesting that we 
have found a global minimum to the modified Kullback-Leibler divergence. 

2.5. Computational validation methodology. In order to validate our 
method, we simulated expression profiles according to various models and 
tested our method's ability to reconstruct the underlying parameters. First, 
we simulated data according to the same independence model defined in 



Section 2.2 The underlying spatiotemporal subregion expression levels were 
sampled from a log-normal distribution with standard deviation 0.5. The 
simulated measurements bt were computed from these subregion levels ac- 
cording to our model of the Arabidopsis root in ([T]). Finally, multiplicative 
error was added, distributed according to a log-normal distribution with 
standard deviation 0.03 to simulate measurement noise. This procedure cre- 
ated expression data with varying but comparable expression levels, which 
we will call the "uniform" dataset. However, since we are particularly in- 
terested in genes for which the expression levels are not uniform, we also 
produced simulations with the expression level for a given section or cell type 
raised by a factor of 10, which we will call the "elevated" dataset. In this 
dataset, we only measured the error for the same section or cell type which 
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was elevated. These simulations measure our ability to detect a dominant 
expression pattern. 

In addition, we designed simulations that test the robustness of the algo- 
rithm to failures of the bilinear model for root expression levels. For each 
section and cell type, we simulated data in which the expression levels for 
cells in that section or cell type did not follow the bilinear model, and call 
these the "section" and "cell type" datasets respectively. Instead, the ex- 
pression levels in the given section or cell type were chosen independently 
according to a log-normal distribution with standard deviation 0.5\/2- The 
factor of y/2 was introduced because the product of two log-normally dis- 
tributed numbers with standard deviation 0.5 is distributed log-normally 
with standard deviation 0.5\/2- 

The predictions were compared to the true expression levels across the 
spatiotemporal subregions within each section and each cell type. For each 
section and each cell type, the expression levels in its spatiotemporal subre- 
gions were averaged, ignoring those combinations which are not physically 
present in the root, (i.e. those whose entry in Table [2] is 0). The difference 
between the predicted and true average expressions was computed as a pro- 
portion of the true average expression. We then computed the root mean 
square of the proportional error over 500 simulations. 

2.6. Visualization of predicted expression patterns. Predicted ex- 
pression values were colored according to an Arabidopsis root template (Fig- 
ure 0. The green channel of each cell was set according to a linear mapping 
between the expression range shown in the template [1,10] or [1,5] to the 
range [0,255]. Expression values above or below that range are given values 
of 255 or respectively. The mapping is also shown to the right of the 
false color image in the form of a gradient key. Phloem cells by longitudinal 
section are visualized separately on the right hand side of the root as they 
are physically occluded by other cells in the left hand side representation. 
The minimum and maximum range of expression value visualized can also 
be adjusted by the user. 

2.7. In vivo validation methodology. To validate predicted expression 
values, we used transgenic Arabidopsis thaliana lines containing transcrip- 
tional GFP fusions in the Columbia ecotype [12]. For each gene being 
validated, six plants from at least two insertion lines previously described 
as expressing GFP were characterized. All plants were grown vertically 
on IX Murashige and Skoog salt mixture, 1% sucrose and 2.3 mM 2-{N- 
morpholino)ethanesulfonic acid (pH 5.7) in 1% agar. Seedlings were pre- 
pared for microscopy at 5 days of age. Confocal images were obtained using 
a 25x water-immersion lens on a Zeiss LSM-510 confocal laser-scanning mi- 
croscope using the 488-nm laser for excitation. Roots were stained with 10 
//g/mL propidium iodide for 0.5 to 2 minutes and mounted in water. GFP 
was rendered in green and propidium iodide in red. Images were saved in 
TIFF format. Images were manually stitched together in Adobe Photoshop 
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CS2 using the Photomerge command. The black background surrounding 
the root was modified to ensure uniformity across figures. No other image 
enhancement was performed. 

3. Results 

3.1. Computation validation. The root mean square percentage errors 
in the reconstruction of each parameter are shown in Table [3j In the first 
two columns, where the data were generated according to the bilinear model, 
the error rate is generally no greater than the simulated measurement error. 
In most cases, elevated expression led to a lower error rate. In particular, 
reconstruction of expression in procambium was much more accurate in the 
elevated dataset. 

The last two columns show that the algorithm is robust to violations of 
the bilinear model. Also, the predicted expression level in each cell type is 
generally not greatly affected by the failure of the model in other cell types, 
and similarly with sections. 

3.2. In vivo validation. To determine whether our algorithm is able to 
accurately resolve spatiotemporal subregion-level transcript expression val- 
ues, it would be ideal to compare the predictions to measured microarray 
expression values of the same spatiotemporal subregion. However, due to 
technical constraints, it is not possible to measure mRNA expression to such 
a degree of specificity and thus we cannot validate the estimates directly. In- 
stead, we validated the method by visually comparing the predicted pattern 
of expression to patterns obtained from transcriptional GFP fusions using 
laser scanning confocal microscopy, as described in |12j . 

For each gene validated, a false-colored root image was generated by col- 
oring each spatiotemporal subregion of an annotated Arabidopsis root tem- 
plate (Figure [TJ according to the expression level in that subregion as pre- 
dicted by our method. This false-colored image was then visually compared 
against the actual pattern of fluorescence observed in plants expressing a 
transcriptional GFP fusion specific for the promoter of that gene. These 
transcriptional GFP fusions contain up to 3 kb of regulatory sequence up- 
stream of the translational start site of the respective gene. In many cases, 
this sequence is sufficient to recapitulate endogenous mRNA expression pat- 
terns as defined by cell type resolution microarray data [12]. This compara- 
tive method of validation allows us to assess the accuracy of spatiotemporal 
subregion expression predictions in an efficient and technically feasible way. 

As a benchmark validation test, a set of three transcriptional fusions 
which were used to obtain some of the marker-line dataset were examined: 
S18(AT5G12870), S4(AT3G25710), and S32(AT2G18380). These fusions 
were originally selected for use in profiling because they exhibited enriched 
cell type expression as observed by laser scanning confocal microscopy and 
subsequently confirmed in the microarray expression data. The expression 
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Error rate 
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1.6 


3.6 


3.1 


Hair cell 
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2.8 


9.1 


4.3 


Non-hair cell 


3.0 


2.1 


3.1 


3.0 


Cortex 


2.9 


2.1 


6.9 


3.6 


Endo dermis 


2.8 


2.2 


3.5 


3.2 


Xylem pole pericycle 


3.3 


3.1 


10.8 


4.9 


Phloem pole pericycle 


3.0 


2.9 


9.4 


4.9 


Phloem 


3.0 


2.9 


3.0 


3.0 


Phloem ccs 


3.3 


3.4 


11.7 


4.9 


Xylem 


2.2 


2.1 


2.5 


2.2 


Lateral root primordia 


3.5 


3.0 


3.4 


3.3 


Procambium 


8.3 


1.8 


12.7 


12.7 



Table 3. Root mean square percentage error rates in the 
reconstruction of simulated data. The first column is under 
a model of comparable but varying expression levels across 
all sections and cell types. The second type is the error rate 
when that section or cell type has its expression level raised 
by a factor of 10. The third and fourth columns show models 
in which the bilinear assumption is violated in one of the 
sections or one of the cell types respectively. In all cases, 3% 
measurement error has been added to the expression levels. 



predictions from our method accurately recapitulated the observed pattern 
of all three benchmark genes (Figure [2] and data not shown). 

To assess the novel predictive ability of our method to reconstruct in vivo 
expression patterns given missing data, we selected transcriptional fusions 
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for genes for which our method predicts expression in cell types or in spa- 
tiotemporal subregions that were not marked by fluorescent marker-lines in 
the original dataset. At least two lines per transcriptional fusion were mon- 
itored. With respect to an unmarked cell type, we selected a candidate gene 
predicted be our model to be highly expressed in the columella and develop- 
ing procambium. Imaging of a transcriptional fusion of this gene confirmed 
this expression (Figure [3]). 

We next determined if our method could correctly differentiate expression 
in a specific developmental stage of a cell type. The collection of marker-lines 
used to generate the original dataset included a marker for all developmen- 
tal stages of non-hair cells, composed of their precursors (atrichoblasts) and 
fully developed non-hair cells. However, the marker-line used for hair cells 
only marks mature hair cells, and not their precursors (trichoblasts). Using 
predictions from our method we tested a candidate gene with predicted ex- 
pression throughout the epidermis — in mature hair cell, trichoblast, mature 
non-hair cell and atrichoblast cell files — with higher expression predicted in 
non-hair cells than in hair cells. This differential expression was validated 
using a transcriptional fusion (Figure [4]) demonstrating that our method is 
not only able to identify expression in a developmental stage of a cell type 
not marked by the marker-line data, but also to accurately differentiate rel- 
ative levels of a transcript. However, it should be noted that expression in 
the transcriptional fusion did not fully corroborate the expression predicted 
by our algorithm — specifically, expression was found in the lateral root cap 
which was not predicted by our algorithm. 

Examination of the raw microarray expression data revealed that expres- 
sion was not elevated in the lateral root cap in the input microarray data. 
Most likely, the presence of GFP is not indicative of erroneous reconstruction 
of gene expression in this case. Instead, the transcriptional fusion does not 
contain sufficient regulatory elements to direct the appropriate expression as 
described in [12], perhaps within downstream sequences. For this reason, a 
comparison of the ratio between raw marker line and section expression data 
can be obtained as a link for each gene so that the user can simultaneously 
assess raw expression data with the reconstructed expression patterns. 



4. Discussion 

We have shown that spatiotemporal patterns of gene expression in the 
Arabidopsis root can be reconstructed using information from the marker- 
line and longitudinal datasets. Current experimental techniques are limited 
in their ability to rapidly and accurately microdissect organs into all com- 
ponent cell types at all developmental stages. Our computational technique 
helps to overcome these limitations. We fully integrate the marker-line and 
longitudinal data sets into a comprehensive expression pattern, across both 
space and time. In particular, this method has enabled the identification 
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of Arabidopsis root procambium and trichoblast-specific genes, which have 
been previously experimentally intractable cell types. 

Our high-resolution expression patterns will allow us to better understand 
the regulatory logic that controls developmental processes of the Arabidopsis 
root. These transcriptional regulatory networks are key to understanding 
developmental processes and environmental responses. With only a portion 
of these genes and fewer cell types, high-resolution spatiotemporal data has 
been used to identify transcriptional regulatory modules 0]. Our more ac- 
curate and complete dataset will allow a more comprehensive discovery of 
regulatory networks across additional cell types. 

Moreover, we expect that our algorithm and the model which underlies it 
are applicable to time course experiments on other heterogeneous cell mix- 
tures. Measurements in multicellular organisms are taken from complex cell 
mixtures of organs, tissues, heterogeneous cell lines, or cancerous samples. 
When precise histological characterization of these samples can estimate 
underlying cell type composition, our method can be used to reconstruct 
the underlying cell type-specific gene expression patterns or any other type 
of quantitative data, such as high-throughput protein abundance measure- 
ments. Theoretically, this algorithm can be applied to identify missing data 
in any experimental system that captures data in two or more dimensions 
which are assumed to be independent of one another. 
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Probe 


Gene(s) 


Longitudinal 


Marker-line 


Ratio 


246630_at 


AT1G50730 


1.073 


1.239 


0.866 


246980_at 


AT5G67530 


1.676 


1.605 


1.045 


247163_at 


AT5G65685 


1.423 


1.508 


0.944 


247286_at 


AT5G64280 


1.146 


1.292 


0.887 


247334_at 


AT5G63610 


1.75 


1.497 


1.169 


247363_at 


AT5G63200 


1.225 


1.305 


0.938 


248266_at 


AT5G53440 


2.514 


1.867 


1.347 


250524_at 


AT5G08520 


1.691 


1.54 


1.099 


250791_at 


AT5G05610 


1.151 


1.334 


0.863 


251104_at 


AT5G01720;AT5G01715 


1.214 


1.257 


0.966 


251233_at 


AT3G62800 


1.171 


1.204 


0.973 


252157_at 


AT3G50430 


1.398 


1.21 


1.155 


253409_at 


AT4G32960 


1.406 


1.572 


0.894 


253565_at 


AT4G31200 


1.461 


1.326 


1.101 


253826_s_at 


AT4G27960;AT5G53300 


26.884 


26.78 


1.004 


255253_at 


AT4G05000 


1.167 


1.696 


0.688 


255704_at 


AT4G00170 


1.35 


1.853 


0.729 


255725_at 


AT1G25540 


1.507 


1.636 


0.921 


255838_at 


AT2G33490 


1.691 


1.473 


1.148 


255946_at 


AT1G22020 


1.335 


1.481 


0.901 


256236_at 


AT3G12350 


1.293 


1.978 


0.654 


256907_at 


AT3G24030 


1.601 


1.751 


0.915 


256961_at 


AT3G13445 


1.383 


1.441 


0.96 


258269_at 


AT3G15690 


1.794 


1.489 


1.205 


259243_at 


AT3G07565 


1.678 


1.768 


0.949 


259280_at 


AT3G01150 


1.302 


1.941 


0.671 


259313_at 


AT3G05090 


1.275 


1.642 


0.776 


259341_at 


AT3G03740 


1.342 


1.606 


0.835 


259800_at 


AT1G72175 


1.159 


1.4 


0.828 


260133_at 


AT1G66340 


1.278 


1.636 


0.781 


261348_at 


AT1G79810 


1.16 


1.464 


0.792 


261515_at 


AT1G71800 


1.236 


1.366 


0.905 


261634_at 


AT1G49970 


1.468 


1.732 


0.847 


261666_at 


AT1G18440 


1.258 


1.297 


0.97 


261744_at 


AT1G08490 


1.265 


1.384 


0.914 


262089_s_at 


AT1G56000;AT1G55980 


1.554 


1.194 


1.301 


262379_at 


AT1G73020 


1.15 


1.398 


0.823 


262672_at 


AT1G76050 


1.306 


1.356 


0.963 


262860_at 


AT1G64810 


1.412 


1.39 


1.016 


263984_at 


AT2G42670 


1.327 


1.314 


1.01 


264307_at 


AT1G61900 


1.273 


1.718 


0.741 


265129_at 


AT1G30970 


1.333 


1.411 


0.945 


267401_at 


AT2G26210 


1.198 


1.476 


0.812 



Table 4. Mean expression values and scaling factors of ubiq- 
uitously, evenly expressed probesets across longitudinal and 
marker-lines 
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Figure 1. Arabidopsis root template used for expression 
pattern overlays. 
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Figure 2. (A) Expression of AT2G18380 in all developmen- 
tal stages of the phloem was predicted by our method and vi- 
sualized in a representation of the Arabidopsis root. Phloem 
cells are shown external to the root. (B) GFP expression in 
the longitudinal axis and (C) expression in cross-section of 
expression driven by the AT2G18380 promoter validate the 
prediction. 
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Figure 3. (A) Our method correctly predicts specific ex- 
pression of a candidate gene in a cell type, procambium, 
that is only covered by a general tissue marker, WOL. Ex- 
pression conferred by the candidate gene promoter fused to 
GFP as a reporter was visualized in the columella (B) and 
in the procambium by a longitudinal section (B) and a cross 
section (C). The label X indicates the xylem axis. The ex- 
pression also validates a maximal peak in the meristematic 
zone. 



RECONSTRUCTING SPATIOTEMPORAL GENE EXPRESSION DATA 19 




Figure 4. (A) Our method correctly predicts candidate 
gene expression in trichoblast cells in the meristematic zone, 
which are not currently covered by any marker-lines. Fur- 
thermore, the algorithm was able to predict differential ex- 
pression within epidermal tissue with high expression in non- 
hair cells and atrichoblasts (immature non-hair cells in the 
meristematic and elongation zone) and decreased expression 
in hair cells or trichoblasts (immature hair cells in the meris- 
tematic and elongation zone). Expression conferred by the 
candidate gene promoter fused GFP as a reporter was visu- 
alized in the epidermis in a longitudinal section (B) and was 
specifically identified as high in atrichoblasts, and lower in 
trichoblasts (marked with an asterisk) in cross-section (C). 
Trichoblasts or hair cells differentiate at the junction between 
two underlying cortical cells. 



